Skip to content

Conversation

@hmmr
Copy link
Contributor

@hmmr hmmr commented May 24, 2021

Instances of nodetool generate random node name prefixes to facilitate running multiple simultaneous calls in parallel. However, each time nodetool connects to the target node, a new atom is created on the latter. If this happens frequently and/or long enough, it will eventually crash the node as it hits the atom table limit. As a workaround, if the caller can guarantee calls are serialized and isolated in time, defining an env variable $NODETOOL_NODE_PREFIX will create identical atoms for node name prefix, thus avoiding generation of new atoms.

The proposed change is complimentary to #868, aiming to address the issue, reported by one of our customers, in which a riak node hit the atom table limit (yes, all of 1M+ entries) and crashed. A postmortem showed the table filled with [email protected], accumulated over a period of time resulting from calls to riak admin status every 5 min.

Note that I did not attempt to do any changes that may need to be done, to the same effect, in extended_bin_windows, as it's not straightforward for me which they would be (my knowledge of scripting in Windows is some 30 year old).

Instances of nodetool generate random node name suffxes to facilitate running multiple simultaneous calls in parallel.  However, each time nodetool connects to the target node, a new atom is created on the latter.  If this happens frequently and/or long enough, it will eventually crash the node as it hits the atom table limit.  As a workaround, if the caller can guarantee calls are serialized and isolated in time, defining an env variable $NODETOOL_NODE_PREFIX will create identical atoms for node name prefix, thus avoiding generation of new atoms.

The proposed change is complimentary to erlware#868, aiming to address the issue, reported by one of our customers, in which a riak node hit the atom table limit (yes, all of 1M+ entries) and crashed. A postmortem showed the table filled with `[email protected]`, accumulated over a period of time resulting from calls to `riak admin status` every 5 min.

Note that I did not attempt to do any changes that may need to be done, to the same effect, in extended_bin_windows, as it's not straightforward for me which they would be (my knowledge of scripting in Windows is some 30 year old).
@hmmr hmmr changed the title optionally allow static node name suffixes optionally allow static node name prefixes May 24, 2021
@tsloughter
Copy link
Member

Thanks. As with the other this issue is resolved when running on 23+ but I guess this solution is simple enough that I can't say no to the change.

This code will go away when 23.1 is the oldest OTP we support, but that won't be for a long time.

# are entirely sequential.
if [ x != x$NODETOOL_NODE_PREFIX ]; then
echo $NODETOOL_NODE_PREFIX
if [ x != x"$NODETOOL_NODE_PREFIX" ]; then
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not check if $NODETOOL_NODE_PREFIX is defined and not empty?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no strong opinions here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, then please switch to using -z. That is how these are checked in the rest of the script, so good to be consistent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using -z now.

Copy link
Contributor Author

@hmmr hmmr Jun 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tsloughter Sorry for potentially dumb question, now that you have approved the PR even though some tests are failing, what do we do next? I'm asking this because there's an accompanying change I wanted to post to basho/riak, which would eventually allow operators to do NODETOOL_NODE_PREFIX=static42 riak admin status and never lose sleep over atom table limit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotta figure out why the tests fail so I can merge it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's sort of no reason for it; I see tests failing only on windows and this change being on the non-windows stuff?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any progress on this? Much as I would be happy to help, I have no windows knowledge myself. Can anybody help?

@tsloughter
Copy link
Member

Oh, facepalm on windows it tries to compile rebar3 and I don't think that work son otp-19 anymore. I say we just remove windows otp-19 (and may as well remove otp-20) while adding otp-24.

Sorry its not related to your actual PR, but do you mind slipping that in here as another commit? It may be faster than me getting a new PR created, reviewed and merged with the change.

@tsloughter
Copy link
Member

Just change otp_version: [19, 20, 22, 23] to otp_version: [22, 23, 24] in .github/workflows/wintest.yml

@ferd
Copy link
Collaborator

ferd commented Jun 25, 2021

hah. Gleam's stuff for windows doesn't have 24 yet.

no otp-24 for windows yet
@tsloughter tsloughter merged commit fe7cd29 into erlware:master Aug 25, 2021
@hmmr hmmr deleted the patch-2 branch August 27, 2021 13:14
hmmr pushed a commit to TI-Tokyo/rebar3_cuttlefish that referenced this pull request Dec 27, 2021
hmmr pushed a commit to TI-Tokyo/rebar3_cuttlefish that referenced this pull request Dec 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants